[SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API#3091
[SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API#3091davies wants to merge 3 commits intoapache:masterfrom
Conversation
davies
commented
Nov 4, 2014
|
cc @mengxr |
|
Test build #22882 has started for PR 3091 at commit
|
|
Test build #22882 has finished for PR 3091 at commit
|
|
Test FAILed. |
|
Test build #22886 has started for PR 3091 at commit
|
|
Test build #22886 has finished for PR 3091 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
What happens if r is JavaArray or JavaList but not pickleable? Are we expecting that downstream can handle it?
There was a problem hiding this comment.
The caller will handle it. The JavaArray/JavaList is iterable in Python, caller can access the internal objects in this array/list.
|
Test build #22913 has started for PR 3091 at commit
|
|
Test build #22913 has finished for PR 3091 at commit
|
|
Test PASSed. |
|
LGTM. Merged into master and branch-1.2. Thanks @davies ! |
```
pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
:: Experimental ::
If `observed` is Vector, conduct Pearson's chi-squared goodness
of fit test of the observed data against the expected distribution,
or againt the uniform distribution (by default), with each category
having an expected frequency of `1 / len(observed)`.
(Note: `observed` cannot contain negative values)
If `observed` is matrix, conduct Pearson's independence test on the
input contingency matrix, which cannot contain negative entries or
columns or rows that sum up to 0.
If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
test for every feature against the label across the input RDD.
For each feature, the (feature, label) pairs are converted into a
contingency matrix for which the chi-squared statistic is computed.
All label and feature values must be categorical.
:param observed: it could be a vector containing the observed categorical
counts/relative frequencies, or the contingency matrix
(containing either counts or relative frequencies),
or an RDD of LabeledPoint containing the labeled dataset
with categorical features. Real-valued features will be
treated as categorical for each distinct value.
:param expected: Vector containing the expected categorical counts/relative
frequencies. `expected` is rescaled if the `expected` sum
differs from the `observed` sum.
:return: ChiSquaredTest object containing the test statistic, degrees
of freedom, p-value, the method used, and the null hypothesis.
```
Author: Davies Liu <davies@databricks.com>
Closes #3091 from davies/his and squashes the following commits:
145d16c [Davies Liu] address comments
0ab0764 [Davies Liu] fix float
5097d54 [Davies Liu] add Hypothesis test Python API
(cherry picked from commit c8abddc)
Signed-off-by: Xiangrui Meng <meng@databricks.com>